The dataset is related to red variant of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The inputs include objective tests (e.g. pH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Quality grades distribution:
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
As we can see on the statistics summary above, no wines have quality value smaller than 3 or bigger than 8. Also, we can see that the quality value is discrete, and should be treated as an ordinal variable.
Let’s start by looking on what are the distributions by some of the variables:
Taking a first look on the graphs, we can see that the quality has a somewhat normal distribution. The same happens with pH and Density. the distribution is mostly right skewed for all other attributes, which seems to point to consistent low concentrations of those attributes.
Specifically, let’s take a look on the Sulfites and Sulphates distributions:
As we can see, most wines have a low concentration of those compounds, with just few of them having a higher amount of sulphates and sulfites.
There are 1599 wines in the dataset with 12 features (as seen above). The variable quality is discrete and varies from 0 to 10, but in this dataset, the minimum is 3 and the maximum 8.
The main feature of interest in this dataset is the quality.
The other features will be used to investigate their influence in the perceived quality of the wine, in special the ones that relate to the perceived flavor (like volatile acidity, residual sugar and chloride).
The presence of sulphates and SO2 (sulfites) is also evaluated. Sulfites are generated by the fermentation and aging processes, and may taint the wine flavor. One common way to balance this effect is to add sulphate - usually Copper Sulfate (CuSO4) to reduce the formation of sulfites. I will investigate how the levels of sulfites and sulphates affect the perceived quality of the wine.
## `geom_smooth()` using method = 'gam'
Correlation between Density and Alcohol level:
##
## Pearson's product-moment correlation
##
## data: density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
Here, we start by evaluating how the alcohol percentage affects the density. The graph and the correlation index are consistent with: - Beverages are mostly water; - Water is more dense than alcohol; - Higher percentages of alcohol make a wine less dense.
Correlation between Quality and Alcohol level:
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
We can see that the best evaluated wines have consistent higher levels of alcohol, and also the presence of a high number of outliers in the quality grade 5. The correlation test seems to point in this direction too.
Correlation between Quality and pH level:
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
We can see that the pH level has null to very little effect on the perceived quality of the wine.
Correlation between Quality and Citric Acid level:
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The graph and the calculations seems to indicate a small correlation between the Citric Acid amount and the perceived quality of the wine.
Correlation between Quality and Residual Sugar level:
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
The quality also seems to not be affected by the residual sugar, but there are several outliers in this case.
Correlation between Quality and Sulphates level:
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
The quality seems to be lightly affected by the presence of sulphates, and there are several outliers in the 5-6 quality range.
Correlation between Quality and Volatile Acidity level:
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
The volatile acidity is negatively correlated to the quality - the less, the better.
Our main investigation was about how the quality is affected by several attributes in the dataset. As can be seen above, some variables affect the perceived quality positively (alcohol or sulphates), some negatively (volatile acidity) and some seem to not affect whatsoever (residual sugar).
The relationship between the density and the alcohol level seems strong - the more alcohol, less dense is the wine - which makes sense: as any beverage, wines are mostly water. As alcohol is less dense than water, the more alcohol, less total density.
The relationship between density and alcohol percentage. And among the ones studying the wine quality, the relationship between the alcohol percentage and the perceived quality.
The amount of Free SO2 is consistent with the amount of Total SO2 in the studied wines. The number of wines showing high level of sulphates is small.
Consistent with the previous graph, we can see that most wine have small concentrations of Sulphates, and that does not affect considerably the total SO2.
We can see a strong correlation between the pH level and the amount of citric acid in the wine. The pH level seems to affect less the perceived quality than the citric acid, however.
As we can see, most wines have low level of both Free and total SO2, and seem to use small amounts of Sulphates. Few wines have high levels of sulphates, and that may indicate a good control over aging process by the producers.
Also, the expected relation between pH and citric acid is present (lower pH = higher acidity). And we can see that the quality of the wine is affected by the level of Citric Acid, but not so much by the pH level.
I was expecting to see smalles levels of added sulphates on higher quality wines, which could indicate a better or more traditional aging process. It seems the opposite - higher quality wines have higher sulphate levels.
As explained in the dataset, we can see that there’s a light correlation between the perceived quality and the citric acid level. Interestingly, we can see that this relationship does not extend to the pH level - not all wines with high citric acid level have low pH.
The residual sugar seems to not affect the perceived quality. This is somewhat interesting, because the vinho verde is not a sweet wine (which could have made the sweeter ones to have a poor evaluation).
Of all the parameters evaluated, this one seems interesting to me - the correlation between quality and alcohol level. There’s a somewhat strong correlation between the alcohol level and the perceived quality of the wine. Maybe the reviewers were more interested in the effects of wine than flavors? :)
The redwine dataset contains 1599 observations, across 12 variables2, from sometime around 2009 for the red vinho verde wine. My initial approach was to look at the variables names and their summary statistics, to identify interesting values for study.
My main focus was the quality variable - it is defined as the median from three separate reviews from expert in wines. I tried to investigate what chemical characteristics were consistent with the grades.
The main challenge I found was understanding the relationship about some chemical processes used in the wine production - the addition of Sulphates to improve the aging process, for instance. Also, I was under the impression that sweeter wines would be worse evaluated than drier ones - and the data does not support this point of view.
Several factors seem to affect the perceived quality - some of them positively, some not. Among the positive ones we can see that citric acid and alcohol were the most proeminent ones, and the volatile acidity negatively affects quality.
Some limitations need to be considered: the reviews were made by a small set of reviewers, there are no information about the methodology adopted in these reviews, and the dataset only studies an specific kind of wine - the red wine variety of vinho verde, without considering the specificities among them type of grape, for instance).
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib↩
the first variable in the dataset (X), is just a sequential ID for each observation, and was ignored.↩